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Abstract 

Multivariate statistics can be applied to a vast number of scientific and technological problems, including analysis, 
modeling, hypothesis testin, prediction. In this work, we develop an introduction to some of the main concepts in mul¬ 
tivariate statistics, including random vectores, joint probability distributions, joint variation measurements (including 
covariance and Pearson correlation coefficient), generic multivariate moments, as well as parametric and non-parametric 
estimation of multivariate probability distributions. 


‘As good luck would have it.’ 

W. Shakespeare. 

1 Introduction 

Probability and statistics are fascinating areas with a wide 
range of theoretical and application potential in virtually 
every scientific and technological area. It can be used 
for data normalization and enhancement, characteriza¬ 
tion, analysis, modeling, simulation, hypothesis testing, 
prediction... The list is particularly long, because al¬ 
most every scientific and technological problem can be 
approached from the probabilistic point of view. The very 
fact that you are reading these lines provides further indi¬ 
cation that probability and statistics are interesting and 
important. 

In a previous work, namely CDT-13 PQ, a concise in¬ 
troduction to univariate statistics was presented. By 
the term univariate , it is typically meant random experi¬ 
ments involving a single measurement or random variable. 
While this type of situations is certainly important, often¬ 
times we deal with random experiments involving several 
random variables, which gives rise to the concept of mul¬ 
tivariate statistics (e.g. mm), which corresponds to 
the main subject developed in the present work. 

Figure [l] depicts a random experiment that is being 
characterized in terms of M respective random variables 
Xi, X 2 ,... ,Xm- At each realization of the random ex¬ 
periment, a whole set of instances of these variables can 


be observed. As a simple example, one can imagine an 
apple orchard from which fruits are obtained and have 
properties such as weight, length, sugar index, etc., mea¬ 
sured and represented as a respective random variable Xi , 
z = l,2,..., M. 



Figure 1: A multivariate random experiment being character¬ 
ized in terms of M respective measurements or random variables 
Xi, X 2 ,..., Xm, which can be represented as a random vector X. 
Each time the random experiment is performed, a respective set of 
these variables is obtained as samples. 

We start by presenting the concept of random vectors, 
and follow by describing joint (multivariate) probabil¬ 
ity distributions, multivariate moments, measurements of 
joint variation of data (correlation, covariance, Pearson 
correlation coefficient), as well as the interesting problem 
of parametric and non-parametric multivariate probabil¬ 
ity distribution estimation. 

Multivariate statistics involves long integral and differ¬ 
ential expressions, reflecting the multiple involved ran- 
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dom variables. In order to provide a more progressive 
approach, and as a didactic resource, in this text several 
of the related equations are presented first with respect 
to only two random variables, X\ and X 2 , followed by 
the respective generic expression considering M random 
variables. 

It is strongly recommended that the previous CDT- 
13 [T] be read first and/or jointly as a preparation for 
the current text. 


2 Random Vectors 

Given a random experiment involving M measurements 
or random variables Xj, each of its realizations can be 
expressed in terms of a respective random vector X : 


' - 

*2 


X M J 


( 1 ) 


For instance, let’s consider that we are interested in 
studying an apple orchard. The random experiment can 
be defined as collecting a sample of N apples at a given 
autumn day. At each realization of this experiment, per¬ 
formed on different days, we have a new sample of fruits, 
which can be characterized in terms of M respective mea¬ 
surements, such as weight, length, sugar degree (e.g. Brix 
index), etc. Each of these measured properties is under¬ 
stood as a respective random variable JQ, i = 1,2,..., M. 

Multivariate statistics is to a great extent concerned 
about the characterization and prediction of the values of 
the random variables constituting the respective random 
vector X. An interesting related concept is the ensemble 
of a multivariate random experiment, corresponding the 
complete set of possible respective realizations. 


3 Joint Probability Distributions 


distribution of the random vector in some random exper¬ 
iment: 


p(X 1 ,X 2 ) > 0 for any Xi,X 2 ; (2) 

/ OO POO 

/ p(X 1 ,X 2 )dX 1 dX 2 = 1 (3) 

-oo J — OO 

These conditions imply that p(X i,X 2 ) be non-negative 
and that the total volume under it be equal to 1. 

One of the reason probability distributions are so im¬ 
portant is that they allow us to calculate probabilities 
of observing intervals of values of the random variables 
and also can be used to define statistical moments (see 
Section [6|. Indeed, it is important to keep in mind that 
probability distributions are densities , implying that the 
probability of observing any specific random vector is null, 
i.e. P(X 1 = a, X 2 = b) = 0, for a, biriR. 

However, it is possible to calculate probabilities of ob¬ 
serving random variable values within given intervals, i.e.: 


P(a 1 < X\ < < 22 , b\ < X 2 < b 2 ) — 

r>CL2 rb2 

= / p(X 1 ,X 2 )dX 1 dX 2 (4) 

J a\ Jb% 

which corresponds to the volume between the probabil¬ 
ity distribution surface and the domain X\ x X 2 within the 
region defined by the intervals of interest [a\ < X\ < a 2 \ 
and [bi < X 2 < b 2 \. 

When extended to M random variables, the conditions 
in Equation [2] become: 


p(Xi,X 2 ,.. .,X M ) > 0 for any X ly X 2 , • • .,X M ; (5) 

/ OO POO 

... p(X u X 2 ,..., X m ) dX 1 dX 2 ... dX M = 1, (6) 

-00 J — 00 

corresponding to the same requisites as for the case 
M = 2 above. 

An important joint probability distribution is the mul¬ 
tivariate normal distribution , whose definition for generic 
M random variables is: 


One possible means to better understand X is to 
consider the respective joint probability distribution 
p(Xi,X 2 ,..., Xm)- Indeed, these distributions can pro¬ 
vide all statistical information as possible about the re¬ 
spective random vectors X. 

Joint (or multivariate) probability distributions can be 
known a priori , or need to be estimated, such as by using 
multivariate relative frequency histograms or other meth¬ 
ods to be briefly discussed in Section [7] 

Any probability distribution p{Xi,X 2 ) that satisfies 
the following conditions, and every function that satisfy 
these conditions are a potential candidate for probability 


9P,k{X) = 


which has as parameters the column vector jl containing 
the averages of each respective random variables and the 
data covariance matrix K. In the above equation, \K\ is 
the determinant of K. 

Figure [2] illustrates a bivariate normal distribution 


(M = 2), with jl 


0 

0 


and K = 


1 0.7 

0.7 1 


as a 


3D visualization (a) and as an image with level-sets (b). 
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tained from the normal distribution in Figure [2] In this 
particular case, the other marginal distribution p(X 2 ) can 
be verified to be identical to p(X i). 
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Figure 3: The marginal distribution probability p(X i) obtained 
from the bivariate normal distribution in Fig.[2l 


In the more generic case involving generic M, we have: 


/ OO poo 

... p(X 1 ,X 2 ,...,X M ) 

-oo J — OO 

dX 1 dX 2 ...dX i - 1 dX i+1 ...X M (9) 

4 Conditional Distributions 


(b) 


Figure 2: Visualization of a multivariate normal distribution for 
M = 2 and specific average column vector jl and data variance 
matrix K as a 3d surface (a) and as an image with level-sets (b). 


Observe that jl defines the position of the peak of the 
normal distribution, and K specifies its elongation and 
inclination. Also, it is interesting to note that each of 
these two types of visualizations allow complementary 
information about the distribution probability. For in¬ 
stance, while the 3D visualization provides a good overall 
idea of the surface, including its height, the image allows 
a good idea of the (Xi,X 2 ) structure of the distribution. 

Given a joint probability distribution p(X i, X 2 ), one is 
often interested in deriving the respective marginal uni¬ 
variate distributions p(X 1 ) and p(X 2 ), which can be ob¬ 
tained by integrating along all other variables other than 
the one that is being considere, i.e.: 


Given a probability distribution p(Xi, X 2 ,..., Xm ), we 
may be interested in fixing one of the random variables, 
i.e. Xi = c ) for some real constant ( 7 , in order to obtain a 
new probability distribution with M —1 random variables. 
This act of fixing a given variable is related to the concept 
of conditional probablity. 

In the case of bivariate distributions, the above objec¬ 
tive can be achieved by applying the following equation 
while making X 2 = c : 


p(X 1 \X 2 = c) = 


pjXi^ =c) 

/V p(X u X 2 = c )dX t 


( 10 ) 


which ensures proper normalization of the obtained 
probability distribution, in the sense that its total area 
will be equal to 1. 


5 Correlation, Covariance, and 
Pearson Correlation Coefficient 


/ oo 

-00 

/ oo 

-00 


p(X 1 ,X 2 )dX 2 


p(X 1 ,X 2 )dX 1 


(8) 


Figure |3] depicts the marginal distribution p(X 1 ) ob- 


One of the main senses in which multivariate statistics 
differs from its univariate counterpart regards the fact 
that in the former case we have, as we have more than a 
single random variable, it is necessary to consider possible 
interactions between them, e.g. in the sense of their joint 
variation. 
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For instance, in the case of generic fruits, it is reason¬ 
able to expect that their weight will tend to increase with 
their volume, which provides a possible example of pos¬ 
itive joint variation. On the other hand, it could be 
expected that the smaller the temperature, the higher 
the expenses with heating, which characterizes a negative 
joint variation. 

Because we are often interested in considering joint 
variations, it is important to have more objective, math¬ 
ematical means for quantifying this property. Three con¬ 
cepts are often adopted for this finality: correlation, co- 
variance and Pearson correlation coefficient between two 
given random variables X{ and Xj. Each of them will be 
presented and discussed in this section. 

The simplest of these, here called correlation , is defined 
from the respective joint probability distribution as: 


Another possibility to quantify the joint variation be¬ 
tween two random variables X\ and X2 is in terms of their 
covariance , defined as: 



(Xi - Mx.XV 


CoviX^Xj) = 


fJ-Xj) p{Xi,Xj)dXidXj (13) 


The (unbiased) estimator for the covariance is given as: 


Cov(Xi,Xj) « —* -—— - 




where the averages px, and nx, can be respectively 
estimated as: 


Corr(Xi,Xj)= / / XiXj p(X h X^dX^Xj (11) 

J — oo J — oo 

When a set of observations of the two random variables 
X i:P and Xj^ p , p = 1,2,...,7V is available instead of the 
joint probability p(Xi,Xj), the following equation can be 
considered for obtaining an estimation of the correlation 
between X{ and Xj\ 


X^> (12) 

Figure [4] illustrates the interpretation of correlation for 
3 possible situations, which include a circular uniform dis¬ 
tribution of points centered at the coordinate origin, the 
previous distribution centered at (3,3), and the former 
distribution elongated by a factor of 2 along the X\ axis. 

In Figure |4^a), we have a uniformly distributed set 
of points with circular borders centered at the origin of 
the coordinate system. As the points are uniformly dis¬ 
tributed around (0,0), there will be a strong tendency 
that the products X\X 2 implied by Equation [T 2 | for the 
points at the quadrant (Xi > 0, Xj >0) will cancel with 
the products obtained for the quadrant (Xi < 0, Xj > 0), 
a similar trend occurring between the quadrants (Xi > 
0, Xj < 0) and (Xi < 0, Xj < 0). As a consequence, the 
resulting correlation value could be expected to be very 
small. This is precisely what happens for the points in 
Figure Qa). 

However, when the point distribution in Figure |4ja) 
is shifted to (3,3), resulting the situation shown in Fig¬ 
ure 0b), the above mentioned products no longer tend to 
cancel, and a relatively high correlation value is obtained. 
A similar situation is verified for the point distribution in 
Figure [4j yielding a slightly larger correlation value as a 
consequence of the scaling along the X\ axis. 


PXi 




(15) 


Figure [4] also shows the covariances obtained for each 
of the three considered cases. We observe that, as a con¬ 
sequence of the subtraction of the mean of each of the 
variables in Equation [l4j it resulted identical for situ¬ 
ations (a) and (b), which differ only by their center of 
mass (given by the averages of Xi and Xj). However, the 
elongation of the point distribution in (c) still influences 
the covariance, which resulted different for this case. In¬ 
formally speaking, we could say that the covariance does 
not ‘sense’ the position of the points, being invariant to 
translation. 

The third joint variation statistical measurement con¬ 
sidered in this work is the Pearson correlation coefficient , 
given as 


/ OO />00 

-oo J — 


Xi - n Xl Xj - fi Xj 


PCorr(Xi,Xj) = 
p(X i ,Xj)dX i dX j (16) 


— oo J —oo 


°Xi cr Xj 

which can be estimated from samples of Xi and Xj as: 


Pcorr(Xi, Xj) 


1 N 


A f p I^Xi Xj^p (iXj 


N - 1 


p =i 




cr x 2 


(17) 

where the two standard deviations can be estimated as 


a Xi = +^/Cov{X^Xi) and a Xj = +y / Cov(X,', Xj)- 

The standardized version of a random variable Xi can 
be obtained as: 


Xi = — —(18) 

Interestingly, the Pearson correlation coefficient be¬ 
tween Xi and Xj can be understood as the correlation 
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Xi Xi Xi 

(a) (b) (c) 

Figure 4: Values of correlation, covariance and Pearson correlation coefficient for three instances of points distribution: circular centered at 
(0, 0), the former centered at (3, 3), and then former elongated by a factor of 2 along the x\ axis. The correlation values vary for each case, 
the covariance is not influenced by the translation in (b), and the Pearson correlation coefficient is the same for every case. 


(i.e. Equation 12) between the respective standardized 
versions of those two original variables. Also, observe 
that the Pearson correlation coefficient is bound within 
the interval [—1,1]. 

The Pearson correlation coefficient is illustrated for the 
situations in Figure [4j resulting in identical values in all 
cases. In particular, the scaling by 2 along the Xi axis in 
(c) was normalized by the divisions by the standard devi¬ 
ations, being therefore overlooked by the Pearson corre¬ 
lation coefficient. 

Given a more generic situation involving M random 
variables, it is interesting to define matrices respectively 
to each of the above three measurements of joint variation. 
For instance, in the case of the covariance, we can define 
the covariance matrix of the variables XI, X 2 ,..., Xm as 


K id =Cov{X i ,X j ) (19) 

Observe that the three respective matrices, as a conse¬ 
quence of their respective definitions, are necessarily sym¬ 
metric. 


6 Multivariate Moments 

The correlation and covariance statistical measurements 
of joint variation discussed in the previous section can be 
understood as particular cases of the more general concept 
of statistical moments. 

The M^ kl,k2 ^— moment of the random variables X\ and 
X 2 can be defined as: 


The correlation as seen above can be understood as the 
j^-[fei=i,fe 2 =i]_ momen t 0 f the variables Xi and Xj. 

In the more general case of M random variables, we 
have: 


/ OO POO 

... x k 'x k2 ... 

-00 J — OO 

•.. xft* p(X 1 ,X 2 , ..., X M )dX 1 dX 2 ...dX M (21) 

The centered version of the above moments can be de¬ 
fined as: 

C^’ k *’-’ k ^(X 1 ,X 2 ,...,X M ) = 

/ OO POO 

... (X, - E[X i]) fel (X 2 - E[X !]) fe2 ... 

-00 J —00 

...(X N - ElXi})^ p(X u X 2 ,.. .,X M )dX 1 dX 2 ... 

...dX M (22) 

Now, we have that the covariance between two 
random variables Xi and Xj corresponds to their 
Ci k i’^ 2 ]—moment. The other moments above are also po¬ 
tentially interesting and useful, though tending to have a 
less direct interpretation. 

7 Parametric and Non- 
Parametric Estimation 


M [k uk *]{X u X 2 ) = 



X kl X k2 p(X 1 ,X 2 )dX 1 dX 2 


As with univariate probability distributions, multivariate 
can be specified from a formula p(X\X 2 ,..., xm) or from 
samples of the involved variables X 1 ,X 2 ,... ,Xm- The 
former is only possible through theoretical means. In the 
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case we have a set of samples of the random variables, we 
need to consider a parametric or non-parametric respec¬ 
tive method. In this section, we briefly presents, through 
illustrations, these two possibilities. 

First, consider the table of values sampled for two re¬ 
spective random variables X\ and X 2 . 

Table 1: Samples of random variables X\ and X 2 considered in the 
estimation examples. 


sample 

Xt 

*2 

1 

0.42 

0.58 

2 

-0.31 

-0.49 

3 

0.01 

-0.22 

4 

0.93 

0.78 

5 

0.71 

0.59 

6 

0.88 

0.69 

7 

0.47 

0.38 

8 

0.20 

-0.15 


We have the points, and we need a respective estima¬ 
tion of the probability density. The parametric way is 
to assume some type of probability distribution, let’s say 
multivariate normal, and to estimate its respective pa¬ 
rameters average vector and data covariance matrix. By 
using Equations [15] and [Ml we obtain: 


and 


0.413 

0.270 


(23) 


0.186 0.199 
0.199 0.234 


(24) 


which completely define a bivariate normal distribution 
(Equation [7|) that corresponds to the estimation based on 
the 8 samples. 

Figure [5] depicts the so-obtained bivariate probability 
distribution in terms of level-sets. A reasonable adherence 
can be observed between the original samples (shown in 
orange) and the obtained distribution. 

Now, we briefly address the non-parametric estimation 
of multivariate probability distributions. Unlike paramet¬ 
ric estimation, here we do not make any assumption on 
the formula of the likely probability distribution, but in¬ 
stead rely on the own distribution of original points, which 
are ‘padded’ or statistically interpolated by some means. 

A possible method for multivariate parametric esti¬ 
mation consists of convolving (e-g. 0 ) the original set 



Figure 5: The parametrically estimated bivariate normal probability 
distribution, illustrated in terms of respective level-sets, obtained for 
the samples in Table [T| also shown in orange. 


of points, represented as respective Dirac’s deltas, with 
a suitable kernel. In a sense, this method can there¬ 
fore be understood as if the original samples imprinted 
themselves, through the adopted kernel, into the respec¬ 
tively estimated distribution. The following example il¬ 
lustrates the non-parametric estimation of a possible bi¬ 
variate probability distribution for the points in Table [l] 
considering a circularly symmetric normal kernel. 

In absence of additional information about the sought 
probability distribution, a circularly symmetric normal 
kernel can be considered. Such a kernel is defined by a 
covariance matrix of the type: 


K 


a 0 
0 a 


(25) 


being characterized by perfectly circular level-sets. Fig¬ 
ure [6] illustrates the result of convolving the circularly 
symmetric normal kernel (a = 1) with Dirac’s deltas de¬ 
fined by the position of the samples in Table [l] As ex¬ 
pected, the resulting distribution is more circular than 
that obtained in Figure [5| as a consequence of the adopted 
circularly symmetric kernel. 

Other types of kernels can be considered. For instance, 
Figure ?? illustrates a non-parametric probability den¬ 
sity estimation for the same previous samples but using 
covariance matrix equal to: 


K 


1 0.5 

0.5 1 


(26) 
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Figure 6: The non-parametrically estimated bivariate normal prob¬ 
ability distribution, illustrated in terms of respective level-sets, ob¬ 
tained for the samples in Table ^ also shown in green. A circularly 
symmetric normal kernel with a = 1 has been adopted. 


Figure 7: The non-parametrically estimated bivariate normal prob¬ 
ability distribution, illustrated in terms of respective level-sets, ob¬ 
tained for the samples in Table [l] also shown in green. A circularly 
symmetric normal kernel with covariance matrix given by Eq. |26| 

has been employed. 


8 Concluding Remarks 


Multivariate statistics constitutes a particularly impor¬ 
tant area in science and technology, mainly as a conse¬ 
quence that almost any problem that is treated determin¬ 
istically can be extended to a statistical counterpart. It 
is also essential for dealing with real-world measurements 
and data. 

In the present work, we developed an introductory pre¬ 
sentation of some of the main concepts in multivariate 
statistics, including random vectors, joint probability dis¬ 
tributions, joint variation measurements, and parametric 
and non-parametric estimation. 

The research area of multivariate statistics is ample and 
with vast applications and interfaces with several other 
areas (e.g. signal processing, computer vision, biology, 
physics, etc.), and it is hoped that the reader will be mo¬ 
tivated to probe further into it (e.g. GJEJBIE]). 
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Costa’s Didactic Texts - CDTs 


CDTs intend to be a halfway point between a 
formal scientific article and a dissemination text 
in the sense that they: (i) explain and illustrate 
concepts in a more informal, graphical and acces¬ 
sible way than the typical scientific article; and 
(ii) provide more in-depth mathematical develop¬ 
ments than a more traditional dissemination work. 

It is hoped that CDTs can also incorporate new 
insights and analogies concerning the reported 
concepts and methods. We hope these character¬ 
istics will contribute to making CDTs interesting 
both to beginners as well as to more senior 
researchers. 

Each CDT focuses on a limited set of interrelated 
concepts. Though attempting to be relatively 
self-contained, CDTs also aim at being relatively 
short. Links to related material are provided in 
order to complement the covered subjects. 

Observe that CDTs, which come with absolutely 
no warranty, are non distributable and for non¬ 
commercial use only. 

The complete set of CDTs can be found 
at: https : //www.researchgate.net/proj ect/ 
Costas-Didactic-Texts-CDTs, 
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